WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition

نویسندگان

Tony Robinson

Jeroen Fransen

David Pye

Jonathan Foote

Steve Renals

چکیده

A signiicant new speech corpus of British English has been recorded at Cambridge University. Derived from the Wall Street Journal text corpus, WSJCAM0 constitutes one of the largest corpora of spoken British English currently in existence. It has been speciically designed for the construction and evaluation of speaker-independent speech recognition systems. The database consists of 140 speakers each speaking about 110 utterances. This paper describes the motivation for the corpus , the processes undertaken in its construction and the utilities needed as support tools. All utterance transcriptions have been veriied and a phonetic dictionary has been developed to cover the training data and evaluation tasks. Two evaluation tasks have been deened using standard 5,000 word bigram and 20,000 word trigram language models. The paper concludes with comparative results on these tasks for British and American English.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Issues in Large Vocabulary, Multilingual Speech Recognition

In this paper we report on our activities in multilingual, speaker-independent,large vocabulary continuous speech recognition. The multilingual aspect of this work is of particular importance in Eu-rope, where each country has its own national language. Our existing recognizer for American English and French, has been ported to British English and German. It has been assessed in the context of ...

متن کامل

Investigation of Indian English Speech Recognition using CMU Sphinx

In the recent years, research on speech recognition has given much diligence to the automatic transcription of speech data such as broadcast news (BN), medical transcription, etc. Large Vocabulary Continuous Speech Recognition (LVCSR) systems have been developed successfully for Englishes (American English (AE), British English (BE), etc.) and other languages but in case of Indian English (IE),...

متن کامل

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...

متن کامل

Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation

A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...

متن کامل

Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition

Specifics of hidden Markov model-based speech recognition are investigated. Influence of modeling simple and context-dependent phones, using simple Gaussian, two and threecomponent Gaussian mixture probability density functions for modeling feature distribution, and incorporating language model are discussed. Word recognition rates and model complexity criteria are used for evaluating suitabili...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1995

WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition

نویسندگان

چکیده

منابع مشابه

Issues in Large Vocabulary, Multilingual Speech Recognition

Investigation of Indian English Speech Recognition using CMU Sphinx

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Speaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation

Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition

عنوان ژورنال:

اشتراک گذاری